Specializing i/o access patterns for flash storage

ABSTRACT

Systems and methods for efficiently using solid-state devices are provided. Some embodiments provide for a data processing system that uses a non-volatile solid state device as a circular log, with the goal of aligning data access patterns to the underlying, hidden device implementation, in order to maximize performance. In addition, metadata can be interspersed with data in order to align data access patterns to the underlying device implementation. Multiple input/output (I/O) buffers can also be used to pipeline insertions of metadata and data into a linear log. The observed queuing behavior of the multiple I/O buffers can be used to determine when the utilization of the storage device is approaching saturation (e.g., in order to predict excessively-long response times). Then, the I/O load on the storage device may be shed when utilization approaches saturation. As a result, the overall response time of the system is improved.

CROSS-REFERENCE TO RELATED APPLICATIONS

The present application is a continuation of U.S. patent applicationSer. No. 13/477,966, entitled “SPECIALIZING I/O ACCESS PATTERNS FORFLASH STORAGE”, filed on May 22, 2012 by Christopher Small et al., nowissued as U.S. Pat. No. ______ on ______ which applications is herebyincorporated by reference.

TECHNICAL FIELD

Various embodiments disclosed herein generally relate to solid-statestorage devices. More specifically, some embodiments relate to systemsand methods for optimizing writes to solid-state storage devices.

BACKGROUND

Various types of non-volatile storage media such as, for example,relatively high latency (i.e., longer access times) hard disk drivedevices (HODs) and relatively low latency (i.e., shorter access times)solid-state devices (SSDs) such as flash memory or DRAM can be used forstoring information. HODs generally provide good streaming performance(e.g., reading of large sequential blocks or “track reads”) but do notperform well on random access (i.e., reading and writing of individualdisk sectors) due to slow access times. SSDs, on the other hand, aremore suitable for random and frequent memory accesses because of theirrelatively low latency. With no moving parts, SSDs do not havemechanical delays resulting in the high latency experienced by HODs andseek time is decreased significantly, making the SSDs very fast.

Flash memory is generally accepted as a new tier in the memory hierarchybetween DRAM and disk. In terms of cost per gigabyte (GB), DRAM capacityis more expensive than flash capacity, which is more expensive than harddisk capacity. At the same time, DRAM latencies are less than flash, andflash latencies are less than hard disk. As a result, the cost perinput/output (I/O) operation of flash memory is between DRAM andmagnetic media. This placement in the memory hierarchy often makes flashmemories ideal for caching.

While flash and other solid-state memories sometimes provide for thesame interface as a SCSI or SATA drive, the underlying operation,implementation, and performance between solid-state memories and SCSI orSATA drives may differ substantially. For example, one of the primarydifferences is that storage locations in SSDs need to be erased beforeinformation can be written to them. The device is typically erased inunits (erase blocks) larger than a traditional write unit (sectors).Even with these operational differences, SSDs often use the input/outputinterfaces developed for HODs. As a result, the integration andimplementation of SSD memories based on algorithms developed for HOD maynot capture the full benefit of SSD memories, since the algorithms arenot optimized for the characteristics of SSD based storage.Consequently, improved techniques are needed to employ more effectivelyflash memory and other solid-state devices.

SUMMARY

Various embodiments introduced here generally relate to systems andmethods for customizing input/output (1/0) access patterns for a flashor other non-volatile solid state storage device. Some embodiments usethis customization to create a flash friendly caching algorithm. Thesetechniques, together with various associated components and operations,are able to more efficiently utilize non-volatile solid-state devices(e.g., flash devices, battery-backed RAM, and others). For example,unlike hard-disk drives, erasing data from a solid-state drive typicallytakes more time than writing. As a result, performing an erase torecover one free sector is less valuable than performing an erase torecover a full erase block of sectors. Therefore, in some embodimentsdescribed herein, writes are performed to the solid-state storage deviceonly in integral multiples of erase blocks. In addition, hard-diskdrives often segregate metadata from data to minimize seek time.However, since seek time is not a concern for solid-state devices,metadata may be commingled with the data in many embodiments.

In certain embodiments, a non-volatile solid-state drive has anassociated translation layer to map logical sector addresses to physicaladdresses in the nonvolatile solid-state drive, the non-volatilesolid-state drive is treated as a circular log for storing data (e.g.,in a host-side cache), and the data are written to the non-volatilesolid-state drive in write units the sizes of which are integermultiples of a size of an erase block of the nonvolatile solid-statedrive. Since at least some non-volatile solid-state drives must beerased in erase blocks before writing, this combination of techniques,and variations on it, aligns data access patterns to the underlying,hidden device implementation in order to improve performance.

Embodiments of the present invention also include other methods, systemswith various components, and computer-readable storage media containingsets of instructions to cause one or more processors to perform themethods, variations of the methods, and other operations describedherein. While multiple embodiments are disclosed, still otherembodiments of the present invention will become apparent to thoseskilled in the art from the following detailed description, which showsand describes illustrative embodiments of the invention. As will berealized, the invention is capable of modifications in various aspects,all without departing from the scope of the present invention.Accordingly, the drawings and detailed description are to be regarded asillustrative in nature and not restrictive.

BRIEF DESCRIPTION OF THE DRAWINGS

Embodiments of the present invention will be described and explainedthrough the use of the accompanying drawings in which:

FIG. 1 shows a block diagram of a processing system in which someembodiments of the techniques introduced here may be implemented orutilized;

FIG. 2 is a block diagram illustrating components of a non-volatilesolid-state memory device;

FIG. 3 is a block diagram illustrating examples of sectors, pages, anderase blocks of a non-volatile solid-state memory device;

FIG. 4 is a flow chart illustrating a process for processing a writerequest submitted to a non-volatile solid-state memory device;

FIG. 5 is a flow chart illustrating a process for operating anon-volatile solid-state memory device based on a page replacementpolicy;

FIG. 6 is a flow chart illustrating a process for shedding writerequests and/or read requests from a queue associated with anon-volatile solid-state memory device; and

FIG. 7 is a flow chart illustrating a process for improving theperformance of host-side non-volatile solid-state device caches bybypassing the non-volatile solid-state storage devices.

The drawings have not necessarily been drawn to scale. For example, thedimensions of some of the elements in the figures may be expanded orreduced to help improve the understanding of the embodiments of thepresent invention. Similarly, some components and/or operations may beseparated into different blocks or combined into a single block for thepurposes of discussion of some of the embodiments of the presentinvention. Moreover, while the invention is amenable to variousmodifications and alternative forms, specific embodiments have beenshown by way of example in the drawings and are described in detailbelow. The intention, however, is not to limit the invention to theparticular embodiments described. On the contrary, the invention isintended to cover all modifications, equivalents, and alternativesfalling within the scope of the invention as defined by the appendedclaims.

DETAILED DESCRIPTION

Flash-based solid-state storage devices (SSDs) and other SSDs provide arandom-access interface and sometimes masquerade as a SCSI or SATA diskdrive. However, the underlying implementation and performance profilediffers substantially from disk drives. A disk drive, in the commoncase, will have physical sectors on the disk that are arranged in thesame order as the logical sectors exported by the random accessinterface. For example, sector n is physically adjacent to sector n+1,and writing sectors n . . . (n+m) will take about half as long aswriting sectors n . . . (n+2m). The time it takes to perform a diskwrite is effectively deterministic and induced by the size of the writeand physical characteristics and state of the disk (i.e., the number ofsectors written, the starting position of the disk head, and therotational speed of the platter). Once the disk head is positionedappropriately, the time to write sectors n . . . (n+m) is determined bythe physical characteristics of the drive.

Although a flash drive commonly provides the same interface as a diskdrive, the implementation is very different. One of the primarydifferences is that flash memory needs to be erased before writing, anderased in units (erase blocks) larger than a traditional write unit(sectors). To effectively masquerade as a disk drive, a conventionalflash drive uses flash translation layer (FTL) software. An FTL maps aset of logical sector addresses to a substantially larger physicalstorage pool, typically making no attempt to map logical sector n tophysical sector n. As data is written to a flash drive, the FTL storesthe data in currently-unallocated physical storage, retaining a mappingfrom the logical sector number associated with the data to the physicalsector holding that data. When a logical sector is rewritten, the FTLmakes no attempt to store the new data in the same physical sector;instead it marks the old logical-to-physical mapping as invalid and theold physical sector as available for reuse, and writes the new data to adifferent physical sector. If this process were to continue, it wouldeventually result in all of the physical storage being filled.

To keep flash drives from filling completely and to make storageavailable, SSDs are typically provisioned with more physical storagethan they offer as logical storage. In addition, the FTL performscleaning (sometimes called “garbage collection”). Typically the cleaningprocess includes choosing a region of the physical storage that is oneor more erase blocks, determining which of the sectors in the region arestill valid, copying the sectors to an available region of the storage,updating the logical-to-physical mapping of those sectors, and thenperforming a bulk erase of the region. An invalid sector in the regionwill result in a net increase in available physical sectors.

Unfortunately, the erase operation consumes considerable time,substantially more than a write operation. As a result, the time towrite to a flash drive depends on how “dirty” the drive is (how muchfree space is currently available), how that space is distributed acrossthe drive, how aggressive the FTL is in performing cleaning in thebackground, how much idle time the drive has, and a host of otherfactors.

Various embodiments of the techniques presented here reduce the amountof necessary housekeeping. In some embodiments, techniques are disclosedfor writing metadata and data to a flash storage device in ways thatprovide high performance (e.g., incurring low time overhead or lowlatency) from the device. One application of these techniques is in acache (e.g., a host-side flash cache, or other types of flash cachessuch as in a network, virtual machine hypervisor on a server, etc.)

Some embodiments provide for a data storage system that uses anon-volatile solid-state storage device (e.g., a flash storage device)as a circular log, with the goal of aligning data access patterns to theunderlying, hidden device implementation, in order to maximizeperformance. For example, the write blocks can be an integral number ofthe erase blocks of the device. In addition, metadata (descriptive data)can be interspersed with data into a linear log in order to align dataaccess patterns to the underlying device implementation. Someembodiments also use a technique of cleaning the linear log entriesusing read-ahead or avoidance depending on the replacement algorithm.Multiple I/O buffers can also be used to pipeline insertions of metadataand data into a linear log. The observed queuing behavior of themultiple I/O buffers can be used to determine when the utilization ofthe storage device is approaching saturation (e.g., in order to predictexcessively-long response times). Then, the I/O load on the storagedevice may be shed to a backing store containing identical copies of thedata when utilization approaches saturation. As a result, the overallresponse time of the system is improved.

FIG. 1 shows a block diagram of a processing system 100 in which someembodiments of the techniques introduced here may be utilized. In theembodiments shown in FIG. 1, processing system 100 includes a host 110having a virtual machine 112 using hypervisor 115 to interact withstorage server 120. As illustrated in FIG. 1, storage server 120includes one or more processors 122, a memory 124 with buffer cache 138,a network adapter 126, and a storage adapter 128 interconnected by asystem bus 125.

Hypervisor 115 is virtual machine manager (VMM) that allows multipleoperating systems to run concurrently on a host computer. Hypervisor 115presents a virtual operating platform to the operating systems. In manycases, multiple instances of a variety of operating systems may sharethe virtualized hardware resources. In various embodiments of thepresent invention, to enable cache consistency through live virtualmachine migrations and data management operations on virtual disks, anon-volatile solid-state cache using SSD 117A (e.g., PCle flash card)attached to host 110 may be used by hypervisor 115. The techniquesdisclosed for writing metadata and data can be used with the solid-statecache in ways that improve performance (e.g., lowering overhead time orlatency).

Host(s) 110 and virtual machine(s) 112 may each interact with thestorage server 120 in accordance with a client/server model ofinformation delivery. That is, the host(s) 110 may request the servicesof the storage server 120 and the system may return the results of theservices requested by the host 110, such as by exchanging packets overthe network 160. The virtual host(s) 110 may issue packets includingfile-based access protocols such as the Common Internet File System(CIFS) protocol or Network File System (NFS) protocol over TCP/IP whenaccessing information in the form of files. Alternatively, the host(s)110 may issue packets including block-based access protocols such as theSmall Computer Systems Interface (SCSI) protocol encapsulated over TCP(iSCSI) and SCSI encapsulated over Fibre Channel Protocol (FCP) whenaccessing information in the form of LUNs or blocks.

The storage server 120 can be a computer that provides storage servicesrelating to the organization of information on writable, persistentstorage media, such as SSDs and HODs. The storage server 120 alsoincludes a storage operating system that implements a file system tologically organize the stored data as a hierarchical structure oflogical data containers (e.g., volumes, logical units, directoriesand/or files) on the electronic storage media 140 and magnetic storagemedia 150.

It will be understood by those skilled in the art that the techniquesintroduced here may apply to any type of special-purpose computer (e.g.,file server or storage serving appliance) or general-purpose computerembodied as, or having a storage server including a standalone computeror portion thereof. While FIG. 1 illustrates a monolithic,non-distributed storage server 120, various embodiments are applicableto other types of storage configurations (e.g., cluster storageconfigurations). Moreover, the teachings of this description can beadapted to a variety of storage server architectures including, but notlimited to, a network-attached storage (NAS), storage area network(SAN), or a storage device assembly directly-attached to a client orhost computer. The term “storage server” should therefore be takenbroadly to include such arrangements, including a storage server thatprovides file-based access to data, block based access to data, or both.

Memory 124 includes storage locations that are addressable by theprocessor(s) 122 and adapters and can store software programs and datastructures to carry out the techniques described herein. Processor(s)122 and adapters may, in turn, include processing elements and/or logiccircuitry configured to execute the software programs and manipulate thedata structures. A storage operating system, portions of which may beresident in memory 124 and may be executed by the processor(s) 122,functionally organizes the storage server by invoking storage operationsin support of software processes executing on the server 120. It will beapparent to those skilled in the art that other memory mechanisms, suchas various computer-readable media, may instead be used to store andexecute program instructions pertaining to the embodiments describedherein. The electronic storage media 140 and magnetic storage media 150are configured to provide a persistent, writable storage space capableof maintaining data in the event of a power loss or other failure of thestorage server 120. Accordingly, the electronic storage media 140 andmagnetic storage media 150 may be embodied as large-volume memoryarrays.

The network adapter 126 includes the circuitry and mechanical componentsneeded to connect the storage server 120 to a host 110 over a network160, which may include a point-to-point (P2P) connection or a sharedmedium. Network 160 can be any group of interconnected devices capableof exchanging information. In some embodiments, network 160 may be asfew as several personal computers, special purpose computers, and/orgeneral purposed computers on a Local Area Network (LAN) or as large asthe Internet. In some cases, network 160 may be comprised of multiplenetworks (private and/or public), even multiple heterogeneous networks,such as one or more border networks, broadband networks, serviceprovider networks, Internet Service Provider (ISP) networks, and/orPublic Switched Telephone Networks (PSTNs), interconnected via gatewaysoperable to facilitate communications between and among the variousnetworks.

The storage adapter 128 cooperates with the storage operating systemexecuting on the storage server 120 to access information requested bythe host 110. The information may be stored on the electronic storagemedia 140 and magnetic storage media 150, which are illustrativelyembodied as SSDs and HODs. The storage adapter includes I/O interfacecircuitry that couples to the SSD 140 and HOD 150 over an I/Ointerconnect arrangement, such as a conventional high-performance FibreChannel serial link topology. The information is retrieved by thestorage adapter 128 and, if necessary, processed by the processor(s) 122(or the adapter 128) prior to being forwarded over the system bus 125 tothe network adapter 126 where the information is formatted into a packetand returned to the host 110.

In the illustrated embodiments, buffer cache 138 is part of the memory124. However, this is by way of example and not of limitation as thebuffer cache 138 may be coupled with the memory using, for example, apoint-to-point connection. In addition, the buffer cache 138 may beseparate from the memory 124, part of the memory 124, or part of theprocessor(s) 122. Generally, a buffer cache memory, such as buffer cache138, includes a smaller, lower-latency (faster) memory such as RAM(e.g., DRAM), operable to reduce the average time to perform a memoryaccess. The buffer cache typically stores copies of the data from themost frequently used locations in memory 124 so that when a memoryaccess is performed, the buffer cache may first be checked to determineif required data is located therein, and, if so, the data may beaccessed from the buffer cache 138 instead of the persistent storagemedia, such as SSDs or HODs. In this manner, a buffer cache, such asbuffer cache 138, reduces memory access times by avoiding having toaccess persistent storage to obtain the data.

FIG. 2 is a block diagram illustrating components of a non-volatilesolid-state memory device 200. The non-volatile solid-state memorydevice 200 may be used in any of a variety of places, and for any of avariety of purposes, within a storage environment. For example, thenon-volatile solid-state memory device 200 may be a host-side cache or acache for disk backing storage within processing system 100. Theembodiment of the non-volatile solid-state memory device 200 illustratedin FIG. 2 includes a controller 210 running FTL software 220, temporarystorage device 230, and flash memory chips 240A-240H each havingmultiple cells for storing data. Controller 210 accepts and responds torequests coming via a bus (e.g., PCle, SATA, SAS, FC, or other bus) andinteracts with the flash memory chips 240A-240H. As such, controller 210typically includes the electronics (e.g., embedded processor) thatbridge the flash memory chips 240A-240H (e.g., NAND memory components)to a host/client computer, bus, or server.

Controller 210 may perform a variety of functions including, but notlimited to, garbage collection, encryption, caching (both read andwrite), error correction, and others. Typically, the FTL software 220can include software running on the controller (e.g., on an embeddedprocessor). FTL 220 maps logical (client) sector and page addresses tophysical (internal) page addresses. As new data is sent to the flashdrive, FTL 220 finds the appropriate unused space on the drive's flashmemory chips 240A-240H and stores the data there. In many embodiments,FTL 220 uses an internal data structure to track the correspondencebetween logical and physical addresses.

Each memory cell in flash memory chips 240A-240H may be in a free state,a used state, or an invalid state. The free state indicates that thememory cell is not storing any data. The used state indicates that thememory cell is currently storing some data. The invalid state indicatesthat the data stored in the memory cell is no longer valid. A cellmarked as invalid must be erased before new data can be written to thatcell. However, most non-volatile solid-state memory devices can only beerased in blocks as described in more detail in FIG. 3.

FIG. 3 is a block diagram illustrating examples of sectors, pages, anderase blocks associated with non-volatile solid-state storage devices. Atypical conventional SSD, such as a flash device, allows clients toread/write 512-byte sectors 310 but is implemented in terms of pages 320(usually 2 KB, 4 KB, or 8 KB). Most SSDs must have the target storagelocations erased to make free space before any writing can occur tothose locations. One example is NAND flash. The minimum unit of erasureis referred to as an erase block 330. An erase block could be 32, 64,128, or other number of pages.

Since erasing data from an SSD typically takes more time than writingdata to the SSD, improving the efficiency in the time the drive spendsperforming the erase operations can result in significant performanceimprovements. For example, performing an erase to recover one freesector is less valuable than performing an erase to recover a full eraseblock of sectors. Various embodiments of the techniques introduced herealign the blocks being written with blocks that are being invalidated.In some embodiments, an integral multiple i of an erase block of sectorscan be written out at a time. A region the size of i erase blocks can besubstantially simultaneously invalidated and allocated. This will resultin erase operations that have full value.

The size of an erase block of the memory device may not always beinitially known. Some embodiments estimate the size of the erase blockusing a variety of information and factors such as device type,manufacturer, etc. The estimates start with an erase block that is 2jsectors in size for some j. Then, a write block of 2k sectors may bewritten to the device, where k is chosen to be large enough to be atleast as large as j. This will result in the drive invalidating,erasing, and writing 2k-j erase blocks on each operation. This is incontrast to conventional methods of operating non-volatile solid statememories where erasing can occur to recover one writing sector or anumber of sectors that are less than the size of an erase block.

Traditionally, metadata (descriptive data) is written and storedseparately from the data it describes. With disk storage, the metadatacan be stored in contiguous regions so that the metadata can be read ina small number of I/Os, with a small number of disk seek operationsrequired. SSDs, however, do not exhibit seek time delays, and so thatmotivation does not apply. In some embodiments, the metadata may becommingled with the data.

Various embodiments may delay the write of a particular data item ormetadata item to the non-volatile solid-state storage device for aperiod of time and buffer writes until 2k sectors of data and metadatahave been accumulated. In some cases, this may not be desirable. Oneexample is a storage system which uses flash storage as a cache fordisk-backing storage and which needs to keep the cache state intactthrough unscheduled interruptions such as power failures or systemcrashes. In this case, changes to the state of the cache as reflected inthe data and metadata to be stored on the flash storage must becommitted to the flash storage immediately, before acknowledgingcompletion of the storage system read or write operation which initiatedthe modification of the state of the cache.

Some non-volatile solid-state storage devices contain a power-protectedRAM write buffer. This buffer allows the flash device to quicklycomplete a write request without actually writing data to flash storage.The logic in the non-volatile solid-state storage device that iscontrolling the power-protected RAM write buffer has a policycontrolling when the buffer is written to flash storage (flushed).Typically the buffer is flushed when a write request is received whichis not sequentially following the previous writes, or which otherwisedoes not write to an address near the currently-buffered writes. If thebuffer is flushed while containing less than a full erase-block, theperformance advantages discussed above may be lost.

In some embodiments, when a partially-filled region of 2k sectors mustbe committed to flash storage and the flash storage device contains apower-protected RAM write buffer, additional techniques may be used. Forexample, the entirety of the partially-filled region may be written tothe non-volatile solid-state storage device. The write may start at theaddress which would normally be used for a completely filled region. Asthe partially-filled region fills, subsequent commits continue to writethe entirety of the partially-filled region to the non-volatilesolid-state storage device. Each subsequent write may start at the nextaddress which would normally be used for a completely filled region eventhough the entire region will not be completely filled. This results inoverwriting the beginning sectors of the partially-filled region, butnon-volatile solid-state storage devices with a compatiblepower-protected buffer flush policy will not flush their buffer until anentire erase block has been written.

In some embodiments, only the newly-modified sectors of thepartially-filled region are written to the non-volatile solid-statestorage device. They are written to the same addresses as if the entireregion had been written. This results in overwriting at most only thelast sector of the last write, but non-volatile solid-state storagedevices with a compatible power-protected buffer flush policy will notflush their buffer until an entire erase block has been written.

In other embodiments, a coarse grained technique may be used to providedurability. A well-known sector on the non-volatile solid-state storagedevice may be chosen to contain a “dirty shutdown” indicator indicatingan uncontrolled shutdown of the storage device (e.g., from a loss ofpower). The dirty shutdown indicator may be as small as a single bit.When the storage device initializes, the dirty shutdown indicator ispersistently set to the true state by synchronously writing itscontaining sector. During an orderly shutdown, I/O to the non-volatilesolid-state storage device is quiesced. Once all outstanding II0s to thenon-volatile solid-state storage device have completed, the dirtyshutdown indicator is set to false. If an unscheduled interruption shutsdown the storage device unexpectedly, the dirty shutdown indicator wouldbe in the true state. To determine if the contents of the non-volatilesolid-state storage device can be trusted, the dirty shutdown indicatormay be used. A false value would indicate that the contents of thenon-volatile solid-state storage device could be trusted. A true valuewould indicate that the contents of the non-volatile solid-state storagedevice are potentially inconsistent and therefore should be ignored.

The technique of writing only to an integral number of erase blocks canalso be advantageous in many other applications. One example is when anon-volatile solid state storage device is used as a circular log. FIG.4 is a flow chart illustrating a process 400 for processing a writerequest submitted to a non-volatile solid-state memory device utilizedas a circular log to allow for an efficient, predictable process forutilizing the storage resource. In accordance with various embodimentsof the present invention, one or more of the operations in process 400can be implemented by various system components such as controller 210in FIG. 2. Receiving operation 410 receives a write request to writedata to a non-volatile solid-state storage device (e.g., a flash drive).Determination operation 420 determines the number of sectors needed towrite the data to the storage device. For example, the number of sectorscan be determined by rounding up to the nearest integer the result fromdividing the data size by the size of a sector.

In some embodiments, the sectors of multiple write blocks have beenlogically divided (e.g., by a sector manager or a controller) into asize corresponding to an integral multiple of a size of an erase blockof the non-volatile solid-state storage device. The sectors associatedwith the circular log can be ordered so that targeting operation 430targets each successive write operation to the next sequential sectorrange on the storage device. Once the end of the device is reached,targeting operation 430 can start again at the beginning of the sectors.For example, if w is 2k, then sectors 0 . . . (w−1) are written first,followed by sectors w . . . (2w−1), and then by sectors 2w . . . (3w−1)until the logical end of the drive has been reached. At that point, thetargeting operation 430 may circle back and target write operations towrite blocks 0 . . . (w−1).

Once the sector(s) have been determined, allocation operation 440 writesout the data entry currently stored in the next sequential sector(s),changes the state of the associated cells to invalid, and allocates thecells to allow data to be written. Then, writing operation 450 writesthe data from the write request to the sector(s).

Once the non-volatile solid-state storage device is full, the circularlog wraps around to the beginning of the logical sectors. At this point,old metadata and data entries in the log must be erased (since asolid-state device does not allow writes to locations that are currentlybeing used without first erasing them) before they are overwritten usingallocation operation 440. The term “page replacement policy” is used torefer to a policy or strategy for cleaning these old log entries. Somecommonly known page replacement policies, such as CLOCK, may save asubset of the old metadata and data entries. Other page replacementpolicies, such as First In First Out (FIFO), will evict all entries apriori. Some embodiments adaptively issue log cleaning operations basedon the page replacement policy as described in more detail in FIG. 5.

Some embodiments may issue log cleaning operations based on the pagereplacement policy as described now with reference to FIG. 5. FIG. 5 isa flow chart illustrating a process 500 for operating a non-volatilesolid-state memory device based on the type of page replacement policy.Depending on the type of page replacement policy, the old entries mayneed to be read before they are evicted. The operations in process 500may be performed by the controller within the non-volatile solid-statememory device or a processor associated with the memory device orstorage system. Page replacement determination operation 510 determinesthe page replacement policy of the storage system. Then, decisionoperation 520 determines, based on the type of page replacement policy,whether the page replacement policy saves data (as opposed to evictingall data).

If the page replacement policy is one that may save a subset of the oldmetadata and data entries, then decision operation 520 branches toreading operation 530. Reading operation 530 reads the old entries intomemory (e.g., DRAM 230 in FIG. 2) and evaluates them against the pagereplacement policy. Evaluation operation 540 determines if the oldentries read into memory should be saved. In some embodiments, readingoperation 530 can read the old entries from the flash storage deviceinto memory before they are needed for processing. These entries can beread ahead of time in some embodiments where the log is writtensequentially, and therefore the log blocks containing these entries willnot be written until after the entries have been evaluated against thepage replacement policy. The number of read ahead operations that areissued to the flash storage device can by dynamically controlled by asystem administrator or intelligent management software.

If the page replacement policy is one that does not save the oldmetadata and data entries (i.e., evicts all entries), then decisionoperation 520 branches to eviction operation 550. Eviction operation 550automatically evicts old data entries without reading the data. As aresult, all log cleaning read operations are eliminated since they areunnecessary. Instead of evaluating the old entries, the page replacementpolicy evicts the old entries from the in-memory data structures. Byeliminating these read operations the performance of the flash storagedevice is improved by avoiding unnecessary operations.

In some embodiments, write and/or read operations to/from non-volatilesolid state memory device are queued. The utilization of the memorydevice can be estimated as a function of properties of the queue (e.g.,depth, rate of change of the queue depth, etc). If, for example, thememory device is being used as a cache and the utilization is too high,it may be advantageous to bypass the cache, a process referred to as“shedding” the read or write operation. FIG. 6 is a flow chartillustrating a process 600 for redirecting (shedding) write requestsand/or read requests from a write queue associated with a non-volatilesolid-state memory device. One or more of the operations associated withprocess 600 may be performed by a processor, the storage operatingsystem, the FTL, or other hardware component. When multiple writeoperations are queued to the non-volatile solid-state storage devices,the utilization of the storage device may be inferred by observing ormonitoring the write (or read) queue depth performed by monitoringoperation 610.

Estimation operation 620 generates an estimate of the utilization of thenonvolatile solid-state memory device. The estimate may be based, atleast in part, on the write queue depth determined by monitoringoperation 610. For example, if the finite pool of write buffers isnearly empty (i.e., the queue depth is large), the non-volatilesolid-state memory device may be inferred to be operating at nearly 100%utilization. If the finite pool of write buffers is nearly full (i.e.,the queue depth is small), the nonvolatile solid-state memory device maybe inferred to be operating at nearly 0% utilization. When operatingnear 100% utilization, a newly-submitted I/O request will experience along response time due to the queuing delay incurred by waiting for thecompletion of I/O's preceding it in the queue.

Once an estimate has been generated, first threshold decision operation630 determines whether the estimate of the utilization exceeds a firstthreshold. If not, then decision operation 630 branches back tomonitoring operation 610 to continue the monitoring of the write queuedepth. If decision operation 630 determines that the first threshold hasbeen exceeded, then operation 630 branches to second threshold decisionoperation 640 to determine if the estimate of the utilization exceeds asecond threshold. If the second threshold is not exceeded, then decisionoperation 640 branches to write shedding operation 650. Both the firstand second thresholds can be adaptively set based on monitored systemperform, by a system administrator, by a storage operating system, orother component or subsystem associated with a storage system.

Write shedding operation 650 redirects (sheds) write requests (e.g.,nonessential writes such as those resulting from a read-miss) within thewrite queue to a secondary storage device. For example, for a storagesystem using flash storage devices as a cache for some backing store, orsome other organization where data is available both in some otherbacking store and in the flash storage, the total I/O capacity(throughput) of the backing store may exceed that of the flash storagedevices. The backing store at least provides some I/O capacity inaddition to that of the flash storage devices. If the I/O demand exceedsthe I/O capacity of the flash devices (that is, the I/O demand wouldexceed 100% utilization of the flash storage devices), the storagesystem can improve the overall performance of the system by sheddingload from the flash storage devices, and where necessary, satisfying theI/O requests using the backing storage.

Specifically, in the context of cache storage system using non-volatilesolid-state devices (e.g., flash), the system may choose to shed load bydiscarding non-essential writes to the cache storage device when thesize of the write queue to the storage device exceeds some threshold. Anexample of a non-essential write is a write entering new data in thecache storage device as a result of either a read-miss in the cache or awrite to the cache. In the case of a write to some location alreadyentered in the cache, while the data may not be written to the cachestorage device, metadata invalidating the old cache entry must still bewritten to the flash storage device. However, this metadata is muchsmaller than the data and so the net result is still a reduction in theI/O load on the flash storage device. Write requests that are notentered in the cache must be handled by writing the data directly to thebacking storage. Read-misses that are not entered in the cache maysimply be discarded.

If second threshold decision operation 640 determines the size of thewrite queue to the flash storage device exceeds a second largerthreshold, the system may choose to shed additional load by discardingnon-essential reads (in addition to non-essential writes) from the cachestorage device using shedding operation 660. An example of anon-essential read is a flash read to satisfy a cache read-hit. The readrequest may instead be handled by reading from the backing store ratherthan the flash storage device, provided that the data requested does notuniquely reside in the cache (e.g., the cache uses a write-though orwrite-around policy, or the cache uses a write-back policy but the datarequested is not marked as only available in the cache). Sheddingoperation 660 then branches to monitoring operation 610 where the writequeue depth is monitored.

In addition to shedding (or redirecting) write requests, someembodiments of the present invention can bypass the cache storage fornon-essential operations. In a computer storage system, a storage cachefrequently has a shorter service time than the backing store but a lowertotal I/O capacity (maximum throughput). In such systems, at some highlevel of I/O demand, the response time for accessing the cache willexceed the response time for directly accessing the backing store due tothe large queuing delay in accessing the underlying storage for thecache. At this level of I/O demand, the system can provide betterperformance by bypassing the cache storage for non-essential operations.For example, rather than serving a read I/O request from the cache, thesystem may offer better performance by serving the read I/O requestdirectly from the backing store.

Prior cache bypass mechanisms focus on CPU caches where there is noasymmetry between the service time and maximum throughput of the cachestorage and the backing storage. In such caches, the intent of cachebypass mechanisms is to identify the addresses of data which will not bereferenced again and to avoid caching such data, since caching it wouldpollute the cache with useless data and thereby reduce the cache hitratio. In contrast, various embodiments discussed herein apply whenthere is an asymmetry between the service time and maximum throughput ofthe cache storage and the backing storage. It does not attempt toidentify data which should not be cached by its address but insteadmakes opportunistic decisions based on the current ratio of the totalI/O demand on the cache versus the available I/O capacity of the cachestorage device.

FIG. 7 is a flow chart illustrating a process 700 for improving theperformance of host-side non-volatile solid-state device (e.g., flash)caches by bypassing the nonvolatile solid-state storage devices when theload on the devices is so high that it is faster to obtain the data fromthe backing store. One or more of the operations associated with process700 may be performed by a processor, the storage operating system, orbypass logic. In some embodiments, the bypass logic does not requirecoupling the cache logic to direct measurements of the I/O response timeof the storage devices; it is self-contained within the cachealgorithms.

As illustrated in FIG. 7, access estimation operation 710 estimates theexpected response time for accessing the cache given the current levelof I/O demand. Access estimation operation can be performedcontinuously, periodically, on a pre-determined schedule, and/or uponthe detection of one or more events. Bypass estimation operation 720estimates the expected response time for bypassing the cache anddirectly accessing the backing store. Techniques for estimating thesevalues are described below. If comparison operation 730 determines thatthe expected response time for accessing the cache is less than or equalto the expected response time for directly accessing the backing store,then comparison operation 730 branches to cache access operation 750where the cache is accessed. If comparison operation 730 determines thatthe expected response time for accessing the cache is greater than theexpected response time for directly accessing the backing store, then,comparison operation 730 branches to determination operation 760 todetermine if the cache access is essential.

The cache access may be essential to the correct operation of the cache,in which case it does not bypass the cache. One example of an essentialoperation is the invalidation of existing data in the cache when thatdata is overwritten. Another example of an essential operation is thereading of data from a write-back cache when the cache contains the onlycurrent copy of the data. One example of a non-essential operation isthe insertion of data in the cache when the data is written. Anotherexample of a non-essential operation is the reading of data from eithera write-back or a write-through cache when a current copy of the dataexists in the backing storage. If the cache access is not essential,then determination operation 760 branches to bypass operation 770 wherethe system bypasses the cache and directly accesses the backing store.For both read and write operations, if the cache access is essential thedetermination operation 760 branches to cache access operation 750 wherethe cache is accessed.

In various embodiments, the expected response times for the cachestorage and the backing storage may be estimated using models based onqueuing theory with measurements made by the cache logic as input to themodels. For example, the cache logic may maintain measurements of themean response times and the current and mean queue depths for the cacheand backing stores. Given a measured mean response time of t, and ameasured mean queue depth of N, we may use Little's Law to express themean service time t, as:

i _(s) =t _(r) /N

The expected instantaneous response time tri can then simplify thecurrent instantaneous queue depth Ni times the mean service time t8:

t _(ri) =N _(i) t _(s)

The cache logic may compute these expected response times and direct theI/O to bypass the cache if the operation is non-essential to the cacheand the expected response time for the backing store is less than thatof the cache storage.

In some embodiments, the expected response time may be estimated using amodel along with measurements of a sampling of the response times forthe cache storage and/or backing storage rather than completestatistics. For example, rather than instrumenting all operations on thecache store, the cache logic may instrument only the write operations.This reduces the CPU overhead cost of the bypass algorithm since only afraction of the operations incur the overhead cost of the performancemeasurement code.

In other embodiments, the expected response time may be qualitativelyestimated using a model and measurements of a sampling of the responsetimes for the cache storage and/or backing storage, rather thanquantitatively estimated. The expected response time is large when thecurrent instantaneous queue depth of the sampled subset of theoperations is large (i.e., when there are a large number of outstandingsampled operations). When the expected response time of the cachestorage is large and the expected response time of the backing store issmall, the cache logic may direct the I/O to bypass the cache if theoperation is non-essential to the cache.

The expected response time for the cache storage may be estimated usingone of the methods described above, but the expected response time forthe backing storage maybe estimated as constant that is based onperformance measurements made in a laboratory using a representative I/Oworkload.

The effect of the cache bypass on the cache hit ratio may be moderatedby preferentially choosing which types of non-essential operations tobypass. As the expected response time of the cache storage approachesthat of the backing store, the bypass logic first chooses to bypass lesspreferred operations such as insertion of newly-written data into thecache, but does not bypass more-preferred operations such as servicingread hits from the cache. When the expected response time of the cachestorage exceeds that of the backing store, the bypass logic bypasses allnon-essential operations, leaving only the operations that are essentialto the correct operation of the cache.

In various embodiments, the cache manages all writes to the cachestorage device as appending to a circular log. It maintains a fixed-sizepool of free buffers to be filled and then asynchronously written to thelog on the cache storage device. The number of write operations inprogress at the cache storage device can be inferred by subtracting thecurrent number of free buffers in the pool from the size of the pool,and as described above can be used to qualitatively estimate theresponse time for the cache storage. When the number of free buffersbecomes less than a certain threshold (determined heuristically, viamodeling, or through experimental measurements), the cache logic beginsbypassing the insertion of newly-written data in the cache, and theinsertion of data from cache read-misses. It must still enterinvalidations if the newly written data overwrites data already in thecache. When the number of free buffers becomes less than a second, lowerthreshold, the cache logic begins bypassing read operations which hit inthe cache, as long as the cache does not contain the only up-to-datecopy of the data, in addition to continuing to bypass the insertion ofnewly-written data in the cache and the insertion of data from cacheread-misses. One advantage of these embodiments is that they do notrequire the overhead of collecting response time statistics on each I/O,and the software structural advantage of accessing only data structureslocal to the cache write logging code, thus avoiding coupling betweenthe cache software module and other external modules such as the devicedrivers for the cache storage and backing storage.

The techniques introduced here can be embodied as special-purposehardware (e.g., circuitry), or as programmable circuitry appropriatelyprogrammed with software and/or firmware, or as a combination ofspecial-purpose and programmable circuitry. Hence, embodiments mayinclude a machine-readable medium having stored thereon instructionswhich may be used to program a computer (or other electronic devices) toperform a process. The machine-readable medium may include, but is notlimited to, floppy diskettes, optical disks, compact disc read-onlymemories (CD-ROMs), and magneto-optical disks, ROMs, random accessmemories (RAMs), erasable programmable read-only memories (EPROMs),electrically erasable programmable read-only memories (EEPROMs),magnetic or optical cards, flash memory, or other type ofmedia/machine-readable medium suitable for storing electronicinstructions.

In this description, the phrases “in some embodiments,” “according tovarious embodiments,” “in the embodiments shown,” “in otherembodiments,” and the like generally mean the particular feature,structure, or characteristic following the phrase is included in atleast one embodiment of the present invention, and may be included inmore than one embodiment of the present invention. In addition, suchphrases do not necessarily all refer to the same embodiments.

While detailed descriptions of one or more embodiments of the inventionhave been given above, various alternatives, modifications, andequivalents will be apparent to those skilled in the art without varyingfrom the spirit of the invention. For example, while the embodimentsdescribed above refer to particular features, the scope of thisinvention also includes embodiments having different combinations offeatures and embodiments that do not include all of the describedfeatures. Accordingly, the scope of the present invention is intended toembrace all such alternatives, modifications, and variations as fallwithin the scope of the claims, together with all equivalents thereof.Therefore, the above description should not be taken as limiting thescope of the invention, which is defined by the appended claims.

What is claimed is:
 1. A method comprising: maintaining a cache on anon-volatile solid-state storage device for data from a secondarystorage device, wherein the data is written to the cache in integralmultiples of an erase block size of the non-volatile solid-state storagedevice; monitoring a depth of a write queue in a memory to estimate autilization of the non-volatile solid-state storage device; determiningwhether the estimate of utilization exceeds a first utilizationthreshold; and in response to determining that the estimate ofutilization exceeds the first utilization threshold, shedding a writerequest having the data within the write queue to the secondary storagedevice, wherein the shed write request includes a non-essential writeresulting from one of a read-miss in the cache and a write to the cachewhen a current copy of the data exists in the secondary storage device.2. The method claim 1 further comprising: in response to determiningthat the estimate of utilization does not exceed the first utilizationthreshold, determining whether the estimate of utilization exceeds asecond threshold; and in response to determining that the estimate ofutilization exceeds the second utilization threshold, shedding anon-essential read request to the secondary storage device, wherein theshed non-essential read request includes a read request to satisfy acache read-hit when a current copy of the data exists in the secondarystorage device.
 3. The method of claim 1 further comprising: estimatinga first expected response time for accessing the non-volatilesolid-state storage device; estimating a second expected response timefor accessing the secondary storage device; and accessing the secondarystorage device to satisfy a non-essential input/output (I/O) requestwhen the estimated second expected response time is less than theestimated first expected response time, wherein the non-essential I/Orequest is selected from one of a read-miss in the cache, a write to thecache when a current copy of the data exists in the secondary storagedevice, and a read request to satisfy a cache read-hit when a currentcopy of the data exists in the secondary storage device.
 4. The methodof claim 3 wherein estimating the first expected response time of thenon-volatile solid-state storage device further comprises: inferring anumber of write operations in progress by subtracting a current numberof free buffers within a buffer pool in the memory from a size of thebuffer pool.
 5. The method of claim 3 wherein estimating the firstexpected response time further comprises determining an expectedinstantaneous response time of a current write request in the writequeue as a multiplication of a mean service time of the write queue byan instantaneous queue depth.
 6. The method of claim 5 wherein the meanservice time of the write queue is determined based on a mean servicetime of the current request as a ratio of a measured mean response timeof the write queue to a measured mean depth of the write queue.
 7. Themethod of claim 1 further comprising: determining an expected responsetime of the non-volatile solid-state storage device; and in response todetermining that the expected response time of the non-volatilesolid-state storage device approaches an expected response time of thesecondary storage device, bypassing the cache when the write requestinserts new data into the cache.
 8. The method of claim 7 wherein readrequests are serviced from the cache.
 9. The method of claim 7 furthercomprising: in response to determining that the expected response timeof the non-volatile solid-state storage device exceeds the expectedresponse time of the secondary storage device, bypassing the cache toservice a read request from the secondary storage device.
 10. A systemcomprising: a processor; a memory coupled to the processor; a secondarystorage device coupled to the processor; a non-volatile solid-statestorage device coupled to the processor; and a process executing on theprocessor and configured to: maintain a cache on the non-volatilesolid-state storage device for data from the secondary storage device,wherein the data is written to the cache in integral multiples of anerase block size of the non-volatile solid-state storage device; monitora depth of a write queue in the memory to estimate a utilization of thenon-volatile solid-state storage device; determine whether the estimateof utilization exceeds a first utilization threshold; and in response todetermining that the estimate of utilization exceeds the firstutilization threshold, shed a write request having the data within thewrite queue to the secondary storage device, wherein the shed writerequest includes a non-essential write resulting from one of a read-missin the cache and a write to the cache when a current copy of the dataexists in the secondary storage device.
 11. The system of claim 10wherein the process executing on the processor is further configured to:in response to determining that the estimate of utilization does notexceed the first utilization threshold, determine whether the estimateof utilization exceeds a second threshold; and in response todetermining that the estimate of utilization exceeds the secondutilization threshold, shed a non-essential read request to thesecondary storage device, wherein the shed non-essential read requestincludes a read request to satisfy a cache read-hit when a current copyof the data exists in the secondary storage device.
 12. The system ofclaim 10 wherein the process executing on the processor is furtherconfigured to: estimate a first expected response time for accessing thenon-volatile solid-state storage device; estimate a second expectedresponse time for accessing the secondary storage device; and access thesecondary storage device to satisfy a non-essential input/output (I/O)request when the estimated second expected response time is less thanthe estimated first expected response time, wherein the non-essentialI/O request is selected from one of a read-miss in the cache, a write tothe cache when a current copy of the data exists in the secondarystorage device, and a read request to satisfy a cache read-hit when acurrent copy of the data exists in the secondary storage device.
 13. Thesystem of claim 12 wherein the process executing on the processorconfigured to estimate the first expected response time of thenon-volatile solid-state storage device is further configured to: infera number of write operations in progress by subtracting a current numberof free buffers within a buffer pool in the memory from a size of thebuffer pool.
 14. The system of claim 12 wherein the process executing onthe processor configured to estimate the first expected response time isfurther configured to determine an expected instantaneous response timeof a current write request in the write queue as a multiplication of amean service time of the write queue by an instantaneous queue depth.15. The system of claim 14 wherein the mean service time of the writequeue is determined based on a mean service time of the current requestas a ratio of a measured mean response time of the write queue to ameasured mean depth of the write queue.
 16. The system of claim 10wherein the process executing on the processor is further configured to:determine an expected response time of the non-volatile solid-statestorage device; and in response to determining that the expectedresponse time of the non-volatile solid-state storage device approachesan expected response time of the secondary storage device, bypass thecache when the write request inserts new data into the cache.
 17. Thesystem of claim 16 wherein read requests are serviced from the cache.18. The system of claim 16 wherein the process executing on theprocessor is further configured to: in response to determining that theexpected response time of the non-volatile solid-state storage deviceexceeds the expected response time of the secondary storage device,bypass the cache to service a read request from the secondary storagedevice.
 19. The system of claim 10 wherein the process executing on theprocessor configured to maintain the cache on the non-volatilesolid-state storage device is further configured to: write data to thecache as a circular log having entries aligned with erase blocks of thenon-volatile solid-state storage device.
 20. A non-transitory computerreadable medium having stored thereon program instructions for executionon a processor, the program instructions configured to: maintain a cacheon a non-volatile solid-state storage device coupled to the processorfor data from a secondary storage device coupled to the processor,wherein the data is written to the cache in integral multiples of anerase block size of the non-volatile solid-state storage device; monitora depth of a write queue in a memory coupled to the processor toestimate a utilization of the non-volatile solid-state storage device;determine whether the estimate of utilization exceeds a firstutilization threshold; and in response to determining that the estimateof utilization exceeds the first utilization threshold, shed a writerequest having the data within the write queue to the secondary storagedevice, wherein the shed write request includes a non-essential writeresulting from one of a read-miss in the cache and a write to the cachewhen a current copy of the data exists in the secondary storage device.